Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contiguous PA #424

Merged
merged 9 commits into from
Oct 25, 2024
Merged

Conversation

mfylcek
Copy link

@mfylcek mfylcek commented Oct 24, 2024

Contiguous cache fetching to avoid using costly gather operation. Requires changes in vllm-hpu-extension (HabanaAI/vllm-hpu-extension#17) to work.

Introduces redundant calculations in decoding phase. In all tested cases improves performance over the entire run (5-12%). For even better performance cache defragmentation is required. Only compatible with v2-block-manager.

@mfylcek mfylcek changed the base branch from main to habana_main October 24, 2024 13:15
@xuechendi
Copy link

xuechendi commented Oct 24, 2024

Hi, @mfylcek , we've been follow this branch, just did test on Gaudi3 - static batch_size as 128

test script:

python -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --max-num-seqs 128 --disable-log-requests --dtype bfloat16 --block-size 128 --gpu-memory-util 0.9 --num-lookahead-slots 1 --use-v2-block-manager  --max-model-len 4096

# repeat command below 3 times for warming up then get final result
python benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name sonnet --dataset-path ./sonnet.txt --request-rate 512 --num-prompts 56 --port 8080 --sonnet-input-len 1024 --sonnet-output-len 1024 --sonnet-prefix-len 100

@xuechendi
Copy link

xuechendi commented Oct 25, 2024

@mfylcek @michalkuligowski ,
our team submitted this PR: #426 which will effectively reduce fragmentation when creating block_list.

Here is the performance I measured with PR 426
data deleted

@xuechendi
Copy link

From observation, after warm up,

  • BlockList with only PR424 will be double-sized with lots of [-1]
  • Blocklist with PR424 + PR426, blocklist size will be as same as previous run

@mfylcek mfylcek added the habana Issues or PRs submitted by Habana Labs label Oct 25, 2024
@michalkuligowski michalkuligowski merged commit 5b7f685 into habana_main Oct 25, 2024
19 checks passed
@michalkuligowski michalkuligowski deleted the dev/mfylcek/contiguous_pa_main_24_10 branch October 25, 2024 12:35
@madamczykhabana madamczykhabana restored the dev/mfylcek/contiguous_pa_main_24_10 branch October 25, 2024 12:42
madamczykhabana added a commit that referenced this pull request Oct 25, 2024
madamczykhabana added a commit that referenced this pull request Oct 25, 2024
@xuechendi xuechendi mentioned this pull request Oct 25, 2024
1 task
afierka-intel pushed a commit that referenced this pull request Oct 26, 2024
Contiguous cache fetching to avoid using costly gather operation.
Requires changes in vllm-hpu-extension
(HabanaAI/vllm-hpu-extension#17) to work.

Introduces redundant calculations in decoding phase. In all tested cases
improves performance over the entire run (5-12%). For even better
performance cache defragmentation is required. Only compatible with
v2-block-manager.
afierka-intel pushed a commit that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants